Introduction:

A recent study published by the Harvard Injury Control Research Center shows that the frequency of mass shooting has increased over time. Some of the deadliest mass shootings in the U.S. happened in February 2018 including the Parkland high school shooting and the massacre at the Tree of Life synagogue in Pittsburgh that occured in October 27, 2018 which prompted widespread national horror. But the death toll is not irrelevant to shooters’ motives. Criminologists who’ve studied them believe that body counts play a role in their calculations.But Mass shooting say a lot about the soceiety, country and how well we are making progress in overcoming loss of life.

Data Preprocessing

Load the data from source.

mass_shootings <- read.csv("~/STA 518/Preparations/Using-a-Machine-Learning-to-Predict-US-Mass-Shooting/Mass Shootings Dataset Ver 5.csv")

Library

Lets import all the necessary libraries which are needed for the data cleaning as well as data vizualizations.

library(data.table) # A faster way to handle data frames in R,Creating data frames
library(readr) # read_csv function file I/O
library(dplyr) # Data Manipulation
library(janitor) # perfectly format data.frame column
library(tidyr) # tidy manipulation
library(tidyverse) #for data cleaning
library(stringr) # to manipulate character type data variables
library(reshape2) # easy to transform data between wide and long formats.
library(lubridate) # For easy handling dates
library(tm) # Text Minning
library(wordcloud) # Form Wordcloud
library(SnowballC) # goes well with wordcloud
library(ggthemes) # For prettier ggplot2 plot aesthetics and acessibility for color-blind palettes
library(plotly) # for interactive visualizations
library(ggplot2) # Data visualization
library(knitr) # For pretty tables
library(scales) # To add more ticks and facilitate plot interpretation
library(lattice) #  powerful and elegant data visualization 
library(chron) # Provides chronological objects which can handle dates and times.
library(cowplot)  # Subplots
library(maptools) # Tools for Handling Spatial Objects
library("leaflet") #For Maps
library(maps) # for map visualization
library(kableExtra)
## Error: package or namespace load failed for 'kableExtra':
##  .onLoad failed in loadNamespace() for 'kableExtra', details:
##   call: !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output %in% 
##   error: 'length = 2' in coercion to 'logical(1)'
library(DT) # interactive data table
library(gtable) # construct and manipulate layouts containing graphical elements.
library(grid) # rewrite of the graphics layout capabilities, plus some support for interaction.
library(formattable) #Table
library(ggrepel) # ggrepel provides geoms for ggplot2 to repel overlapping text labels
library(forcats) # Categorical Data Wrangling
library(exploreR)

Clean columns.

First step is to clean the variable name so that it is easy to work with.

mass_shootings <- clean_names(mass_shootings)

#Data Modelling The dataset contains information about . The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information.All the variables are quite self explanatory.

Duplicate data

duplicate_data <- mass_shootings[duplicated(mass_shootings$s_number), ]

# Display duplicate rows
print(duplicate_data)
##  [1] s                    title                location            
##  [4] date                 incident_area        open_close_location 
##  [7] target               cause                summary             
## [10] fatalities           injured              total_victims       
## [13] policeman_killed     age                  employeed_y_n       
## [16] employed_at          mental_health_issues race                
## [19] gender               latitude             longitude           
## <0 rows> (or 0-length row.names)
# Number of duplicate rows
print(nrow(duplicate_data))
## [1] 0

We don’t have any duplicate datasets. All the incident in our datasets are unique. Lets also see if our datasets are recorded in chronological order.

Lets take a quick look at our datasets. Peek a boo!

# Lets take a peek of our datasets
glimpse(mass_shootings)
## Rows: 323
## Columns: 21
## $ s                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ title                <chr> "Texas church mass shooting", "Walmart shooting i…
## $ location             <chr> "Sutherland Springs, TX", "Thornton, CO", "Edgewo…
## $ date                 <chr> "11/5/2017", "11/1/2017", "10/18/2017", "10/1/201…
## $ incident_area        <chr> "Church", "Wal-Mart", "Remodeling Store", "Las Ve…
## $ open_close_location  <chr> "Close", "Open", "Close", "Open", "Close", "Close…
## $ target               <chr> "random", "random", "coworkers", "random", "cowor…
## $ cause                <chr> "unknown", "unknown", "unknown", "unknown", "", "…
## $ summary              <chr> "Devin Patrick Kelley, 26, an ex-air force office…
## $ fatalities           <int> 26, 3, 3, 59, 3, 3, 5, 3, 3, 5, 5, 3, 5, 49, 0, 1…
## $ injured              <int> 20, 0, 3, 527, 2, 0, 0, 0, 0, 6, 0, 3, 11, 53, 4,…
## $ total_victims        <int> 46, 3, 6, 585, 5, 3, 5, 3, 3, 11, 5, 6, 16, 102, …
## $ policeman_killed     <int> 0, 0, 0, 1, 0, NA, NA, 1, NA, NA, NA, 3, 5, 0, 0,…
## $ age                  <chr> "26", "47", "37", "64", "38", "24", "45", "43", "…
## $ employeed_y_n        <int> NA, NA, NA, NA, 1, 1, 1, 1, NA, NA, NA, NA, NA, N…
## $ employed_at          <chr> "", "", "Advance Granite Store", "", "", "Weis gr…
## $ mental_health_issues <chr> "No", "No", "No", "Unclear", "Yes", "Unclear", "U…
## $ race                 <chr> "White", "White", "Black", "White", "Asian", "Whi…
## $ gender               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
## $ latitude             <dbl> NA, NA, NA, 36.18127, NA, NA, NA, NA, NA, NA, NA,…
## $ longitude            <dbl> NA, NA, NA, -115.13413, NA, NA, NA, NA, NA, NA, N…

Lets first clean up our data before we vizualise it. There are lot of variables that can be imputed as well we need to clean up the missing values as well as create new variables.

# Factor conversion
mass_shootings$date <- lubridate::mdy(mass_shootings$date) # Lets covert date to date from charcter.
mass_shootings$summary <- as.factor(mass_shootings$summary) # convert summary to factor Variables

Map

Lets see the map to get a sense

world.map <- map_data("state")
ggplot() + 
  geom_map(data = world.map, map = world.map,
           aes(x = long, y = lat, group = group, map_id = region),
           fill = "white", colour = "black") + 
  geom_point(data = mass_shootings, 
             aes(x = longitude, y = latitude, size = fatalities),
             color = "firebrick", alpha = 0.6) +
  xlim(-130, -65) +
  ylim(25, 50) +
  labs(title = "All Mass Shootings that occurred between 1966 to 2017",
       subtitle = "Size represents the fatalities") +
  theme_fivethirtyeight()
## Warning in geom_map(data = world.map, map = world.map, aes(x = long, y = lat, :
## Ignoring unknown aesthetics: x and y
## Warning: Removed 22 rows containing missing values (`geom_point()`).

Date

Lets start with date column. We will sepearte date column and create year,month,day, weekdays, day of week variables.According to the research, the days separating mass shooting occurrence went from on average 200 days during the period of 1983 to 2011 to 64 days since 2011.

mass_shootings <- mass_shootings %>% 
  dplyr::mutate(year = year(date), 
                month = month(date),
                month_name = month.abb[month],
                day = day(date),
                Weekday = weekdays(date),
                dow = lubridate::wday(date),
                Week = week(date))

Title

Lets look at some of the severe Mass Shooting in USA history from our datasets. Datasets is up until last Las Vegas Shootings.

  • Below Wordcloud shows where were most of the Mass shooting occurs.
inci_dents <- mass_shootings %>%
              select(title,location,total_victims) %>%
              arrange(desc(total_victims)) %>% 
              head(10) # Lets pick only top ten otherwise it would be overcrowded.

inci_dents <- inner_join(inci_dents,mass_shootings,by=c("location","title")) 
#Select column of interest
inci_dents <- inci_dents %>% 
                select(title,location,total_victims.x,month,day,year,fatalities,injured)

formattable(inci_dents,align="l",list(total_victims.x=color_tile("lightsalmon","red"),fatalities=color_bar("cyan"),injured=color_bar("darkorchid")))
title location total_victims.x month day year fatalities injured
Las Vegas Strip mass shooting Las Vegas, NV 585 10 1 2017 59 527
Orlando nightclub massacre Orlando, Florida 102 6 12 2016 49 53
Aurora theater shooting Aurora, Colorado 82 7 20 2012 12 70
Virginia Tech massacre Blacksburg, Virginia 55 4 16 2007 32 23
University of Texas at Austin Austin, Texas 48 8 1 1966 17 32
Texas church mass shooting Sutherland Springs, TX 46 11 5 2017 26 20
Fort Hood Army Base Fort Hood, Texas 45 11 5 2009 13 32
Luby’s Cafeteria in Killeen, Texas Killeen, Texas 43 10 16 1991 24 20
McDonald’s restaurant in San Ysidro San Ysidro, California 40 7 18 1984 22 19
Columbine High School Littleton, Colorado 37 4 20 1999 15 24
# Word Cloud
Corpus <- Corpus(VectorSource(mass_shootings$title))
Corpus <- Corpus(VectorSource(Corpus))
Corpus <- tm_map(Corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(Corpus, removePunctuation): transformation drops
## documents
Corpus <- tm_map(Corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(Corpus, stripWhitespace): transformation drops
## documents
Corpus <- tm_map(Corpus, removeWords, stopwords('english'))
## Warning in tm_map.SimpleCorpus(Corpus, removeWords, stopwords("english")):
## transformation drops documents
Corpus <- tm_map(Corpus, removeWords, 'shooting')
## Warning in tm_map.SimpleCorpus(Corpus, removeWords, "shooting"): transformation
## drops documents
wordcloud(Corpus,  scale=c(5,0.5), max.words = 200, use.r.layout=FALSE, 
          rot.per = 0.3, random.order = FALSE, colors=brewer.pal(8, 'Dark2'))

Like mass shootings in general, school shootings have gone from being a rare tragedy to a tragic reality. Already in 2018 there have been more than a dozen instances of gun violence in U.S. schools.We may be unsure where to even begin with such a heavy topic. Consider asking our kids what their questions are before you give our two cents.But, yet, there are many individuals who suffered childhood and adulthood traumas and severe abuse and they did not become rampage shooters. They developed healthy relationships with others. Are school shooters unable to develop healthy relationships with others? ## Total Fatalities

Lets Look at the Total Fatalities Number over the course of year. Lets use Line graph and bar plot to show the

#1. Plot of # number of shootings by year and total victims 
fatalities_yr<- mass_shootings %>% 
                      select(year, fatalities, injured, total_victims) %>%
                      melt(id.vars=c("year"), measure.vars = c("fatalities","injured","total_victims"))

p <- ggplot(data = fatalities_yr %>% 
        group_by(year, variable) %>% 
        summarize(shoot_ings = length(year),fatalities = sum(value)),
             aes(x=year, y=shoot_ings)) + 
  geom_line(size=.5,color="black") +
  geom_point(shape=18,size=1, fill="black",stroke=1.2, color = "black") +
  theme_fivethirtyeight() +
  scale_fill_gdocs() +
  ggtitle("Total Mass Shooting \n between 1966-2017") 
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
q <- ggplot() +
  geom_bar(data = fatalities_yr %>%
             filter(variable == "total_victims") %>%
             group_by(year, variable) %>%
             summarize(shoot_ings = length(year), fatalities = sum(value)),
           aes(x=year, y=fatalities, fill = variable), stat="identity") +
  theme_fivethirtyeight() + 
  guides(fill = FALSE)+ # Remove legends
  ggtitle("Total victims 1966-2017") 
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_grid(p, q, labels = c('A', 'B')) # We can see side by side.

It is clear from above bar graph that shooting is increasing over the course of years. How does the Gender play a role in it?

We will basically sum up into 4 variables. Either the person is Male, Female, Male/Female or Unkown to keep our analysis simple.

# Gender
# Before Clean up 
#table(mass_shootings$gender) # Uncomment to see how it looks before.

# Lest recode Male,Female, Male/Female and Unknown.
mass_shootings$gender[mass_shootings$gender=="M"] <- "Male"
mass_shootings$gender[mass_shootings$gender=="F"] <- "Female"
mass_shootings$gender[mass_shootings$gender=="M/F"] <- "Male/Female"
mass_shootings$gender[mass_shootings$gender==""] <- "Unknown"
# After Cleanup
table(mass_shootings$gender)
## 
##      Female        Male Male/Female     Unknown 
##           5         292           5          21
# nlevels(mass_shootings$gender) # should give us 4 if you want to double check
# sum(table(mass_shootings$gender))==nrow(mass_shootings) # Sanity check 

Lets clean up the cause variables and aggregate into same bin which makes more sense to bucket in.

Mental Health Issues:

How many Mental Health cases do we know about shooters?

table(mass_shootings$mental_health_issues)
## 
##      No Unclear unknown Unknown     Yes 
##      93      13       1     110     106
mass_shootings$mental_health_issues[mass_shootings$mental_health_issues %in%c( "Unclear","unknown")] <-"Unknown"

What are the most common target place choosed by Shooters?

# Target
# sort(xtabs(formula = ~target, data = mass_shootings), decreasing = TRUE) # Uncomment to see 
#Preprocessing Target
mass_shootings$target[mass_shootings$target == "random"] <- "Random" 
mass_shootings$target[mass_shootings$target == "NA"] <- "Unknown" 
mass_shootings$target[mass_shootings$target %in%c("Family","neighbors","Children","Family/Neighbors","Family+random","Friends","Family+students","partner's family","women")]<-"Friends & Family"
 
mass_shootings$target[mass_shootings$target %in% c("Coworkers","Ex-Coworkers","coworkers","Coworker's Family")]<-"Coworkers"
        
mass_shootings$target[mass_shootings$target  %in% c("Students","Students+Teachers","Teachers","school girls","Students+Parents","Family+students")]<-"School"
        
mass_shootings$target[mass_shootings$target %in% c("Ex-Wife","Ex-Wife & Family","Ex-Girlfriend","Ex-girlfriend","Ex-GirlFriend","Girlfriend","Ex-Girlfriend+random","Ex-Girlfriend & Family")]<-"Ex- GF/Wife"
         
mass_shootings$target[mass_shootings$target %in% c("Policeman","police","Marines","Policeman+Council Member", "TSA Officer","Trooper","Social Workers")]<-"Police/Trooper" 
         
mass_shootings$target[mass_shootings$target %in% c("party guests","uninvited guests","club members","birthday party bus")]<-"Parties"
          
mass_shootings$target[mass_shootings$target%in% c("Sikhs","monks","prayer group")]<-"Religious Group"

mass_shootings$target[mass_shootings$target%in% c("welding shop employees","hunters","lawyers","black men","Congresswoman","Contestant","drug dealer", "House Owner", "protestors","postmaster","rapper+random","psychologist+psychiatrist","basketball players")]<-"Other"
   
sort(table(mass_shootings$target), decreasing = TRUE)        
## 
##           Random Friends & Family           School        Coworkers 
##              140               50               38               32 
##      Ex- GF/Wife   Police/Trooper            Other          Parties 
##               17               14               13               11 
##                   Religious Group 
##                5                3

Where does Shooter prefer to choose as choice of their location?

#Open_close_location

# sort(xtabs(formula = ~open_close_location, data = df_1), decreasing = TRUE)
mass_shootings$open_close_location[mass_shootings$open_close_location == "Open+CLose"] <- "Open+Close" 
sort(xtabs(formula = ~open_close_location, data = mass_shootings), decreasing = TRUE)
## open_close_location
##      Close       Open            Open+Close 
##        197         78         28         20
# Location
# Lets keep the location column just by passing remove = FALSE
mass_shootings<- mass_shootings %>% separate(location , into = c("city", "state"), sep = ",", remove = FALSE) # split the location
## Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [147, 176,
## 225, 241].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 46 rows [16, 17, 18, 19,
## 20, 21, 22, 23, 24, 25, 26, 30, 35, 36, 37, 38, 39, 41, 43, 44, ...].
mass_shootings <- mass_shootings %>% mutate(city = tolower(city))# keep all the elemet to lower case

We get additional error for 4 rows [147, 176, 225, 241]. Lets dive into those column and see why we are getting that error.

# Lets look into those rows why it is generating error
mass_shootings[c(147, 176, 225, 241), c("location", "city", "state")]
##                                                       location         city
## 147 Pennsburg, Souderton, Lansdale, Harleysville, Pennsylvania    pennsburg
## 176                      South Valley, Albuquerque, New Mexico south valley
## 225                      Nickel Mines, Lancaster, Pennsylvania nickel mines
## 241                              Santee, San Diego, California       santee
##            state
## 147    Souderton
## 176  Albuquerque
## 225    Lancaster
## 241    San Diego

We can see in these 4 rows we have multiple comma to seperate thats why it throws an error. It took first comma to split into city and state. We can fix these by making seperate more friendly.

# Now we are splitting based on last comma to city and state.
mass_shootings<- separate(mass_shootings, location, c("city", "state"), sep = ",(?=[^,]+$)", remove=FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 46 rows [16, 17, 18, 19,
## 20, 21, 22, 23, 24, 25, 26, 30, 35, 36, 37, 38, 39, 41, 43, 44, ...].
mass_shootings[c(147, 176, 225, 241), c("location", "city", "state")]
##                                                       location
## 147 Pennsburg, Souderton, Lansdale, Harleysville, Pennsylvania
## 176                      South Valley, Albuquerque, New Mexico
## 225                      Nickel Mines, Lancaster, Pennsylvania
## 241                              Santee, San Diego, California
##                                             city         state
## 147 Pennsburg, Souderton, Lansdale, Harleysville  Pennsylvania
## 176                    South Valley, Albuquerque    New Mexico
## 225                      Nickel Mines, Lancaster  Pennsylvania
## 241                            Santee, San Diego    California

We can also find out how many unique values are there for any Location if State is NA:

sort(table(mass_shootings$state),decreasing = TRUE)
## 
##      California         Florida           Texas      Washington         Georgia 
##              29              20              16              14              13 
##         Arizona  North Carolina        New York            Ohio         Alabama 
##              11              11              10              10               9 
##        Illinois       Wisconsin    Pennsylvania        Colorado        Michigan 
##               9               9               8               6               6 
##        Kentucky          Nevada        Oklahoma  South Carolina       Tennessee 
##               5               5               5               5               5 
##        Virginia          Kansas       Louisiana   Massachusetts       Minnesota 
##               5               4               4               4               4 
##     Mississippi          Oregon     Connecticut        Missouri        Nebraska 
##               4               4               3               3               3 
##      New Jersey      New Mexico        Arkansas              CA         Montana 
##               3               3               2               2               2 
##            Utah        Virginia          Alaska              CO          Hawaii 
##               2               1               1               1               1 
##           Idaho         Indiana            Iowa              LA           Maine 
##               1               1               1               1               1 
##              MD              NV              PA    South Dakota          Texas  
##               1               1               1               1               1 
##              TX         Vermont              WA   West Virginia         Wyoming 
##               1               1               1               1               1

But We know that Washington DC is not Considered as State. A state which is referred as “Taxation Without Representation”. Lets find out which column is it. We can ignore it or make a seperate state just for analysis.

unique((mass_shootings[is.na(mass_shootings$state),c("location", "city", "state")])[ ,1]) # 1 Is for location, 2 is for city.
## [1] ""                "Washington D.C."
mass_shootings %>% filter(location == "Washington D.C.")
##     s                title        location            city state       date
## 1 164 Washington Navy Yard Washington D.C. Washington D.C.  <NA> 2013-09-16
##   incident_area open_close_location target     cause
## 1                             Close Random terrorism
##                                                                                                                                                                                                                                                                                                                           summary
## 1 On September 16, 2013, a 34-yearl old contractor for the Navy Yard in Washington D.C, entered the building shooting randomly, killing twelve people and injuring three surviving victims before he was shot and killed by law enforcement officers.  Five other people were injured during the incident while trying to escape.
##   fatalities injured total_victims policeman_killed age employeed_y_n
## 1         13       3            15                0  34             1
##   employed_at mental_health_issues                               race gender
## 1   Navy Yard                  Yes Black American or African American   Male
##   latitude longitude year month month_name day Weekday dow Week
## 1 38.90481  -77.0163 2013     9        Sep  16  Monday   2   37

STATES

When We break down by the states which states have most of the Shootings.

# sort(table(mass_shootings$state),decreasing = TRUE)
# sum(is.na(mass_shootings$state)) # we have 46 Missing

p <-mass_shootings %>%
      filter(!is.na(state)) %>%
      group_by(state) %>%
      summarise(count = n()) %>%
      arrange(desc(count)) %>%
      ungroup() %>%
      mutate(state = reorder(state,count)) %>%
      head(10) %>%
      ggplot(aes(x = state,y = count)) +
      geom_bar(stat='identity',color="white", fill = "lightslategray") +
      geom_text(aes(x = state, y = 1, label = count),
              hjust=0, vjust=.5, size = 4, color = 'black',
              fontface = 'bold') +
    labs(x = 'States', 
         y = 'No. of Shootings', 
         title = 'Number of Shootings/State') +
       coord_flip() +
       theme_fivethirtyeight()
ggplotly(p)
# Percentage by race
perc_state<- mass_shootings%>%
              filter(!is.na(state)) %>% 
              group_by(state) %>%
              summarise(count=n()) %>%
              mutate(perc=round((count/sum(count))*100),"%") %>%
              arrange(desc(count)) %>% 
              head(10)
formattable(perc_state,align=c("l","r","r"),list(count=color_bar("mediumslateblue"),perc=color_tile("white","lightsalmon")))
state count perc “%”
California 29 10 %
Florida 20 7 %
Texas 16 6 %
Washington 14 5 %
Georgia 13 5 %
Arizona 11 4 %
North Carolina 11 4 %
New York 10 4 %
Ohio 10 4 %
Alabama 9 3 %

California,Florida and Texas have the most shooters.

Now we are left with empty values in Location, and there are 45 of those. We can deal with those missing states with he use of Latitude and Longitude data which is avialable to us.But for these we can use google geocode variables but you need API for that.

For all these observations, Location is empty. But coordinates are given in Longitude and Latitude. This is one way to derive address from coordinates to further be used for filling State and City values:

# Convert age variable to numeric
mass_shootings$age <- as.numeric(mass_shootings$age)
## Warning: NAs introduced by coercion
# Impute missing values of Age with mean value of non-missing ages
mass_shootings$age <- ifelse(is.na(mass_shootings$age),
                              mean(mass_shootings$age, na.rm = TRUE),
                              mass_shootings$age)

# Filter mass shootings based on age groups
shootings_less_than_20 <- mass_shootings %>% filter(between(age, 0, 20))
shootings_20_30 <- mass_shootings %>% filter(between(age, 20, 30))
shootings_30_40 <- mass_shootings %>% filter(between(age, 30, 40))
shootings_40_50 <- mass_shootings %>% filter(between(age, 40, 50))
shootings_50_60 <- mass_shootings %>% filter(between(age, 50, 60))
shootings_60_70 <- mass_shootings %>% filter(between(age, 60, 70))
shootings_70_80 <- mass_shootings %>% filter(between(age, 70, 80))

Decade

mass_shootings <- mass_shootings %>% 
  mutate(decade = (case_when(year >= 1960 & year < 1970 ~ "1960s", 
                             year >= 1970 & year < 1980 ~ "1970s", 
                             year >= 1980 & year < 1990 ~ "1980s", 
                             year >= 1990 & year < 2000 ~ "1990s", 
                             year >= 2000 & year < 2010 ~ "2000s", 
                             year >= 2010 & year < 2020 ~ "2010s")))
decade_boxplot <- mass_shootings %>% 
  ggplot(aes(x =decade, y = age, fill = decade)) +
  geom_boxplot() + 
  labs(x = "Each Decade", title = "Age Distribution of Mass Shooter decadewise")+
  theme_fivethirtyeight()
ggplotly(decade_boxplot)
## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

Race

p <-mass_shootings %>%
       filter(!is.na(race)) %>%
        group_by(race) %>%
        summarise(count = n()) %>%
        arrange(desc(count)) %>%
        mutate(race = reorder(race,count)) %>% 
        ungroup() %>%
        ggplot(aes(x = race,y = count)) +
        geom_bar(stat='identity',color="white", fill = "#D8BFD8") +
        geom_text(aes(x = race, y = 10, label = count),
              hjust=0, vjust=.5, size = 4, color = 'black',
              fontface = 'bold') +
        coord_flip() +
       labs(x = 'Race', 
            y = 'No. of Shootings', 
            title = 'Number of Shootings/Race')+
  theme_fivethirtyeight()
       
ggplotly(p)

Percentage by race

# Percentage by race
perc_race <- mass_shootings%>%
              group_by(race) %>%
              summarise(count=n()) %>%
              mutate(perc=round((count/sum(count))*100)) %>%
              arrange(desc(count))

formattable(perc_race,align=c("l","r","r"),list(count=color_bar("mediumslateblue"),perc=color_tile("white","lightsalmon")))
race count perc
White American or European American 144 45
Black American or African American 85 26
Unknown 44 14
Other 24 7
Asian or Asian American 18 6
Latino 5 2
Native American or Alaska Native 3 1
# Pie chart
shooter <- data.frame(table(mass_shootings$race))
colnames(shooter) <- c("race","freq")
ggplot(shooter,aes(x=race,
                   y=freq,
                   fill=race))+
  theme(legend.position="none")+
  geom_bar(stat="identity",width = 1)+
  coord_polar(theta = "y")

The statistic shows the number of mass shootings in the United States between 1982 and November 19, 2018, by race and ethnicity of the shooter(s). Between 1982 and November 2018, 60 out of 107 mass shootings were initiated by White shooters. The Las Vegas strip massacre in 2017 had the highest number of victims between 1982 and 2018, with 58 people killed, and over 500 injured.

Outer legends represents the White American or European American.

Is it fare to say that some other races are also participating in Mass Shooting beside Whites.

Build and Evaluate the Logistic Regression Model

First, let’s split the data into training and testing sets

# Load the caret package
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Set seed for reproducibility
set.seed(6496)

#Split the data

# Split the data into 70% training and 30% testing
train_index <- createDataPartition(incident_data$fatal_incident, p = 0.7, list = FALSE)
## Error in eval(expr, envir, enclos): object 'incident_data' not found
train_data <- incident_data[train_index, ]
## Error in eval(expr, envir, enclos): object 'incident_data' not found
test_data <- incident_data[-train_index, ]
## Error in eval(expr, envir, enclos): object 'incident_data' not found

Train the Model

Now, let’s train the logistic regression model using the training data.

incident_data <- mass_shootings

# Convert target variable to binary (1 for incidents with fatalities, 0 otherwise)
incident_data$fatal_incident <- ifelse(incident_data$fatalities > 0, 1, 0)

# Select relevant features for the model
features <- c("age", "gender", "race", "mental_health_issues")

# Filter out rows with missing values in the selected features
incident_data <- na.omit(incident_data[, c(features, "fatal_incident")])

# Train-test split
set.seed(123)  # for reproducibility
train_indices <- sample(nrow(incident_data), 0.7 * nrow(incident_data))  # 70% train, 30% test
train_data <- incident_data[train_indices, ]
test_data <- incident_data[-train_indices, ]

# Train the logistic regression model
logistic_model <- glm(fatal_incident ~ ., data = train_data, family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Summary of the model
summary(logistic_model)
## 
## Call:
## glm(formula = fatal_incident ~ ., family = binomial(link = "logit"), 
##     data = train_data)
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             -5.496e+00  8.330e+03  -0.001  0.99947
## age                                      1.436e-01  4.469e-02   3.214  0.00131
## genderMale                               2.122e+01  2.509e+03   0.008  0.99325
## genderMale/Female                        3.989e+01  1.463e+04   0.003  0.99782
## genderUnknown                            1.966e+01  2.509e+03   0.008  0.99375
## raceBlack American or African American  -1.971e+01  7.943e+03  -0.002  0.99802
## raceLatino                              -2.246e+00  1.582e+04   0.000  0.99989
## raceNative American or Alaska Native     1.929e+01  1.643e+04   0.001  0.99906
## raceOther                               -1.710e+01  7.943e+03  -0.002  0.99828
## raceUnknown                             -1.918e+01  7.943e+03  -0.002  0.99807
## raceWhite American or European American -1.748e+01  7.943e+03  -0.002  0.99824
## mental_health_issuesUnknown              5.359e-01  5.591e-01   0.958  0.33784
## mental_health_issuesYes                  3.738e+01  3.482e+03   0.011  0.99143
##                                           
## (Intercept)                               
## age                                     **
## genderMale                                
## genderMale/Female                         
## genderUnknown                             
## raceBlack American or African American    
## raceLatino                                
## raceNative American or Alaska Native      
## raceOther                                 
## raceUnknown                               
## raceWhite American or European American   
## mental_health_issuesUnknown               
## mental_health_issuesYes                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 173.20  on 225  degrees of freedom
## Residual deviance: 109.91  on 213  degrees of freedom
## AIC: 135.91
## 
## Number of Fisher Scoring iterations: 20
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

#Test the Model

# Compute accuracy
accuracy <- mean(predicted_classes == test_data$fatal_incident)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9072165
# Compute confusion matrix
conf_matrix <- table(Actual = test_data$fatal_incident, Predicted = predicted_classes)
print(conf_matrix)
##       Predicted
## Actual  0  1
##      0  3  9
##      1  0 85
# Compute precision, recall, and F1-score
precision <- conf_matrix[2, 2] / sum(predicted_classes)
recall <- conf_matrix[2, 2] / sum(test_data$fatal_incident)
f1_score <- 2 * precision * recall / (precision + recall)
cat("Precision:", precision, "\n")
## Precision: 0.9042553
cat("Recall:", recall, "\n")
## Recall: 1
cat("F1-score:", f1_score, "\n")
## F1-score: 0.9497207
# Plot ROC curve
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_curve <- roc(test_data$fatal_incident, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve", col = "blue")

auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
## AUC: 0.9132353
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
# Predict on test data
predicted_probs <- predict(logistic_model, newdata = test_data, type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)

# Evaluate the model performance
confusion_matrix <- table(predicted_classes, test_data$fatal_incident)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the confusion matrix and accuracy
print(confusion_matrix)
##                  
## predicted_classes  0  1
##                 0  3  0
##                 1  9 85
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.907216494845361"

Logistic regression equation

P=eβ0+β1X1/1+eβ0+β1X1P=eβ0+β1X1/1+eβ0+β1X1 # Get the coefficients of the logistic regression model

coefficients <- coef(logistic_model)
print(coefficients)
##                             (Intercept)                                     age 
##                              -5.4960585                               0.1436353 
##                              genderMale                       genderMale/Female 
##                              21.2201803                              39.8907200 
##                           genderUnknown  raceBlack American or African American 
##                              19.6624353                             -19.7136216 
##                              raceLatino    raceNative American or Alaska Native 
##                              -2.2455087                              19.2921908 
##                               raceOther                             raceUnknown 
##                             -17.1013968                             -19.1823710 
## raceWhite American or European American             mental_health_issuesUnknown 
##                             -17.4790179                               0.5358593 
##                 mental_health_issuesYes 
##                              37.3805239